29 research outputs found

    Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

    Get PDF
    Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web Conference (WWW '19

    The importance of neutral examples for learning sentiment

    No full text
    Abstract. Most research on learning to identify sentiment ignores “neutral” examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutral examples. Moreover, the use of neutral training examples in learning facilitates better distinction between positive and negative examples

    Authorship Verification as a one-class classification problem

    No full text
    1 In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference ” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. 1

    Using neutral examples for learning polarity

    No full text
    Sentiment analysis is an example of polarity learning. Most research on learning to identify sentiment ignores “neutral ” examples and instead performs training and testing using only examples of significant polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons and show how neutral examples help us obtain superior classification results in two sentiment analysis test-beds. Many machine-learning problems involve predicting an example’s polarity: is it (significantly) greater than or less than some standard. One canonical example of learning polarity is sentiment analysis, the determination of whether a particular text expresses positive or negative sentiment regarding some issue. The problem of how to exploit a labeled corpus to learn models for sentiment analysis has attracted a good deal of interest in recent years [Dave et al 2003, Pang et al 2002, Shanahan et al 2005]. One common characteristic of almost all this work has been the tendency to define the task as a two-category problem: positive versus negative. In almost all actual polarity problems, including sentiment analysis, there are, however, three categories that must be distinguished: positive, negative and neutral. Not every comment on a product or experience expresses purely positive or negative sentiment. Some – in many cases, most – comments might report objective facts without expressing any sentiment, while others might express mixed or conflicting sentiment. Researchers are aware, of course, of the existence of neutral documents. The rationale for ignoring them has been a reliance on two tacit assumptions: • Solving the binary positive vs. negative problem automatically solves the three-category problem since neutral documents will simply lie near the boundary of the binary model • There is less to learn from neutral documents than from documents with clearly defined sentimen

    The importance of neutral examples for learning sentiment

    No full text
    Most research on learning to identify sentiment ignores “neutral ” examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutral examples. Moreover, the use of neutral training examples in learning facilitates better distinction between positive and negative examples.

    Exploiting Stylistic Idiosyncrasies for Authorship Attribution

    No full text
    Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes 2000). In recent years machine learning methods have been applied to authorship attribution. A few examples include (Matthews & Merriam 1993, Holmes & Forsyth 1995, Stamatatos et al 2001, de Vel et al 2001). Both the earlier "stylometric" work and the more recent machine-learning work have tended to focus on initial sets of candidate discriminators which are fairly ubiquitous. For example, the classical work of Mosteller and Wallace (1964) on the Federalist Papers used a set of several hundred function words, that is, words that are context-independent and hence unlikely to be biased towards specific topics. Other features used in even earlier work (Yule 1938) are complexity-bas
    corecore